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Abstract 

In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively- 
parallel special-purpose computer for astrophysical TV-body simulations. GRAPE-6 is the successor of 
GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was 
the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though 
it can be used for collisionlcss systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The 
processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 
(which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 
MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements 
resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6. 
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1. Introduction 

The TV-body simulation technique, in which the equa- 
tions of motion of TV particles are integrated numerically, 
has been one of the most powerful tools for the study of as- 
tronomical objects such as the solar system, star clusters, 
galaxies, clusters of galaxies and large-scale structures of 
the universe. 

Roughly speaking, the target systems for TV-body sim- 
ulations can be classified into two categories: collisional 
systems and collisionless systems. In the case of collisional 
systems, the evolution of the system is driven by two- 
body relaxation process, in other words, by microscopic 
exchange of thermal energies between particles. In this 
case, the simulation timcscalc tends to be long, since the 
relaxation timescale measured by the dynamical timescale 
is proportional to TV/logTV, where TV is the number of par- 
ticles in the system. 

The calculation cost of the simulation of collisional sys- 
tems increases rapidly as we increase the number of par- 
ticles TV, because of the following two reasons. First, as 
stated above, the relaxation timescale increases roughly 
linearly as we increase TV. This means the number of 
timestcps also increases at least lincarly(Makino & Hut 
1988). The second reason is that it is not easy to use 
fast and approximate algorithms such as Barnes-Hut tree 
algorithm(Barnes and Hut 1986) or the fast multipolc 



method(Greengard & Rokhlin 1987) to calculate the in- 
teraction between particles. Those imply that the cost per 
timcstep is 0(TV 2 ), and that the total cost of the simula- 
tion is <3(TV 3 ). 

There are two reasons why the use of approximate al- 
gorithms for the force calculation is difficult. The first 
reason is the need for relatively high accuracy. Since the 
total number of timcsteps is very large, we need a rather 
high accuracy for the force calculation. The other reason 
is the wide difference in the orbital timescale of particles. 
A unique nature of the gravitational TV-body problem is 
that particles interact only through gravity, which is an at- 
tractive force. This means that two particles can approach 
arbitrary close during a hyperbolic close encounter. In ad- 
dition, spatial inhomogeneity tends to develop, resulting 
in a high-density core and a low-density halo. Even on av- 
erage, particles in the core require much smaller timcsteps 
than particles in the halo do. 

It is clearly very wasteful to apply the same timestep to 
all particles in the system, and it is crucial to be able to 
apply individual and adaptive timestep to each particle. 
Such an "individual timestep" algorithm, first developed 
by Aarseth (1963; 1999), has been the core for practically 
any program that handles the time integration of colli- 
sional TV-body systems such as star clusters and systems 
of planetcsimals. 

The basic idea of the individual timestep algorithm is 
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to assign different times and timesteps to particles in the 
system. For particle i, its next time is fcj + Aij, where 
ti is the current time and Ati is the current timestep. 
To integrate the system, we first chose a particle with 
minimum + Ati and set the current system time t to be 
ti + Ati. Then, we predict the positions of all particles 
at time t and calculate the force on particle i. Finally, 
we correct the position of particle i using the calculated 
force, update U and determine the new timestep Ati. ln 
practice, we force the size of timesteps to be powers of 
two, so that the system time is quantized and multiple 
particles have exactly the same time. In this way, we can 
use parallel or vector processors efficiently, since we can 
integrate multiple particles in parallel (McMillan 1986; 
Makino 1991a). 

It is necessary to use the linear multistep method 
(predictor-corrector method) with variable stcpsizc for 
the time integration. Aarseth adopted an algorithm with 
third-order Newton interpolation. Recently, the method 
based on the third-order Hermite intcrpolation(Makino 
1991b; Makino & Aarseth 1992) has become widely used, 
because of its simplicity. 

In principle, it is not impossible to combine individual 
timestep algorithm and fast algorithms such as Barnes- 
Hut tree algorithm or FMM. McMillan and Aarseth 
(1993) developed such a combination, where the tree 
structure is dynamically updated according to the move of 
particles and force is calculated using multipole expansion 
up to octupolc. They assigned predictor polynomials to 
each node of the tree structure so that they could calculate 
the force from nodes to particles at arbitrary times. 

A serious problem with such a combination is that there 
is no known method to implement it on parallel comput- 
ers with distributed memory. It is not simple to achieve 
a good parallel performance with individual timestep al- 
gorithm, even without the tree algorithm. The reason is 
that simple methods require fast and low-latency commu- 
nication between processors. The recently proposed two- 
dimensional algorithm (Makino 2002) somewhat relaxes 
the requirement for the communication bandwidth, but it 
still requires low-latency communication. When combined 
with the tree algorithm, efficient parallclization becomes 
even more difficult. 

Distributed-memory parallel computers have been used 
to run large-scale cosmological simulations, with or 
without individual timestep algorithm(Dubinski 1996; 
Springcl ct al. 2001). In this case, we use simple spa- 
tial decomposition to distribute particles over processors. 
This works fine with large-scale cosmological simulations, 
where the distribution of particles in large scale is almost 
uniform. Many structures form from initial density fluc- 
tuations, and many small high-density regions develop. 
Even so, we can still divide the entire system so that the 
calculation load is reasonably well balanced. In addition, 
the range of the timesteps is relatively small. 

To parallelize the simulation of a single star cluster is 
much more difficult, because the calculation cost is dom- 
inated by a small number of particles in a single, small 
core(Makino & Hut 1988). Therefore, communication la- 



tency becomes the bottleneck, and it is difficult to paral- 
lelize the simple direct summation algorithm. As a result, 
no good parallel implementation of the combination of the 
tree algorithm and individual timestep algorithm exists. 
To really accelerate calculation of a single cluster, we need 
an approach different from what has been tried. 

There are three different approaches to improve the 
speed of any simulation: a) to use a faster computer, b) 
to use algorithms with smaller calculation cost, and c) to 
improve the efficiency of the algorithm used. Usually, op- 
tion (a) means to use commercially available fast comput- 
ers, which, at present, means distributed-memory parallel 
computers. An alternative possibility is to develop a com- 
puter by ourselves. We have been pursuing this direction, 
starting with GRAPE- 1 (Ito et al. 1990). 

The basic idea of the GRAPE (GRAvity piPE) architec- 
ture (Sugimoto et al. 1990) is to develop a fully pipelined 
processor specialized for the calculation of the gravita- 
tional interaction between particles. In this way, a single 
force-calculation pipeline integrates more than 30 arith- 
metic units, which all operate in parallel. In the cause of 
the Hermite time integration, we also need to calculate 
the first time derivative of the force, resulting in nearly 60 
arithmetic operations. This means that we can integrate 
a large number of arithmetic unit into a single hardware 
with minimal amount of additional logic. 

GRAPE- 1 was an experimental hardware with very 
short word format (relative force accuracy of 5% or so), 
and not really suited for simulations of collisional sys- 
tems. However, its exceptionally good cost-performance 
ratio made it useful for simulations of collisionless sys- 
tems (Okumura et al. 1991; Funato et al. 1992). Also, 
we developed an algorithm to accelerate the Barnes-Hut 
tree algorithm using GRAPE hardware (Makino 1991c), 
and developed GRAPE-1A (Fukushige et al. 1991), which 
was designed to achieve good performance with treecode. 
Thus, GRAPE approach turned out to be quite effective, 
not only for collisional simulations but also for collision- 
less simulations, and also for SPH simulations (Umemura 
et al. 1993; Steinmetz 1996). GRAPE-1A and its succes- 
sors, GRAPE-3 (Okumura et al. 1993) and GRAPE-5 
(Kawai et al. 2000) have been used by researchers world- 
wide for many different problems. 

In this paper, we discuss GRAPE-6, our newest ma- 
chine for the simulation of collisional systems. We briefly 
summarize the history of hardwares here. 

GRAPE-2 (Ito ct al. 1991) adopted usual 64- and 32- 
bit floating point number format, and could be used with 
Aarseth's NBODY3 program. After GRAPE-2, we devel- 
oped GRAPE-3 (Okumura et al. 1993), which is essen- 
tially an LSI implementation of GRAPE- 1. In GRAPE- 1, 
arithmetic operations were realized by fixed-point ALU 
chips and ROM chips, and in GRAPE-2 by floating-point 
ALU chips. Thus, we needed several tens of LSIs to re- 
alize a single pipeline. With GRAPE-3, we implemented 
a single pipeline to a single custom LSI chip, and devel- 
oped a board with 24 chips. In this way, we achieved the 
speed of 9 Gflops per board (24 chips each performing 38 
operations on 10 MHz clock cycle). 
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GRAPE-4(Makino et al. 1997) is similarly a single-LSI 
implementation of GRAPE-2, or actually that of HARP- 
l(Makino ct al. 1993), which was designed to calculate 
force and its time derivative. A single GRAPE-4 chip 
calculates one interaction in every three clock cycles, per- 
forming 19 operations. Its clock frequency was 32 MHz 
and peak speed of a chip was 608 Mflops. 

A major difference between GRAPE-4 and previous ma- 
chines is its size. GRAPE-4 integrated 1728 pipeline chips, 
for the peak speed of 1.08 Tflops. The machine is com- 
posed of 4 clusters, each with 9 processor boards. A single 
processor board houses 48 processor chips, all of which 
share a single memory unit through another custom chip 
to handle predictor polynomials. GRAPE-4 chip uses two- 
way virtual multiple pipeline, so that one chip looks like 
two chips with half the clock speed. Thus, one GRAPE-4 
board calculates the forces on 96 processors in parallel. 
Different boards calculate the forces from different parti- 
cles, but to the same 96 particles. Forces calculated in a 
single cluster arc summed up by special hardware within 
the cluster. 

In this paper, we describe the architecture and per- 
formance of GRAPE-6, which is the direct successor of 
GRAPE-4. The main difference between GRAPE-4 and 
GRAPE-6 is in the performance. The GRAPE-6 chip inte- 
grates 6 pipelines operating at 90 MHz, offering the speed 
of 30.8 Gflops, and the entire GRAPE-6 system with 2048 
chips offers the speed of 63.04 Tflops. 

The plan of this paper is as follows. In section 2, we 
describe the overall architecture, and in sections 3 and 4 
the details of implementation. In section 5, we discuss the 
difference between GRAPE-4 and GRAPE-6. In section 
6 we discuss the performance. Section 7 is for discussions. 
Those who are interested in how to use GRAPE-6, but not 
much in the design details, could skip section 2.1, most of 
section 3 and section 5. 

2. The architecture of GRAPE-6 

In this section, we give the overview of the architecture 
of GRAPE-6. What GRAPE-6 calculates are the follow- 
ing. First, it calculates the gravitational force, its time 
derivative, and potential, given by equations 
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where a^, a^, and <pi are the gravitational acceleration, its 
first time derivative, and the potential of particle i, rrii, 
Xj and Vi are the mass, position and velocity of particle 
i, G is the gravitational constant and e is the softening 
parameter. GRAPE-6 hardware assumes G = 1. If neces- 
sary, the host computer can multiply the result calculated 
by GRAPE-6 by some constant to use G other than one. 



Also note that potential is calculated without minus sign. 
Relative position and relative velocity and Vy are de- 
fined as 



Vij = v,- 



V,;. 



(4) 
(5) 

While calculating the force, it also evaluates the distance 
to the nearest neighbor 

r min = mi\ir lj , (6) 

and the value of index j which gives the minimum dis- 
tance. In addition, it constructs the list of neighbor par- 
ticles, whose distance squared (with softening, rfj +e 2 ) is 
smaller than pre-specified value lij. 

The position Xj and velocity v,,- of particles that ex- 
ert the forces are "predicted" by the following predictor 
polynomial 
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j-O + ^a^o + A^-a^o + v^o, (8) 



where x^p and Vj P are the predicted position and ve- 
locity, Xj o, v^o, BLj,o and a^o are the position, velocity, 
acceleration and its time derivative of particle j at time 
tjfi, and Atj is the difference between the current time tj 
of particle j and system time t, i.e., 

Atj=t-tj. (9) 

2.1. Individual timestep on GRAPE hardware 

Here, wc briefly summarize how GRAPE-6 (and 
GRAPE-4) works with individual timestep algorithm. For 
more detailed discussion, see Makino et al. (1997) or 
Makino & Taiji (1998). 

The time integration proceeds in the following steps 

a) As the initialization procedure, the host sends all data 

(position, velocity, acceleration, its first time deriva- 
tive, mass and time) of all particles to GRAPE mem- 
ory unit. 

b) The host creates the list of particles to be integrated 

at the present timestep. 

c) For each particles in the list, repeat the steps (d)-(g). 

d) The host predicts the position and velocity of the 

particle, and sends them to GRAPE. GRAPE 
stores them in the registers of the force calculation 
pipeline. It also sets the current time to a register 
in the predictor pipeline, 
c) GRAPE calculates the force from all other particles. 
Positions and velocities of other particles at the cur- 
rent time are calculated in the predictor pipeline. 

f) After the calculation is finished, the host retrieves the 

result. 

g) The host integrates the orbits of the particles and 

determines new timesteps. 

h) Update the present system time and go back to step 

(b). 
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Fig. 1. The top level network structure of GRAPE-6. "H" 
indicates a host computer and "PB" indicates a processor 
board. 

Here, the key to achieve good performance is to send 
only particles updated in the current timestep to GRAPE 
hardware. Thus, GRAPE hardware need to have the 
memory unit large enough to keep all particles in the sys- 
tem. This is usually not a severe limitation, since even 
with fast GRAPE hardwares, the number of particles we 
can handle with direct summation algorithm is not very 
large. 

2.2. Top-level network architecture 

The top-level architecture of GRAPE-6 is shown in fig- 
ure 1. It consists of 4 "clusters", each of which comprises 
16 GRAPE-6 processor boards (PB), 4 host computers 
(H), and interconnection networks. These 4 clusters are 
connected by Gigabit Ethernet. For host computers, we 
currently use PCs with AMD Athlon XP 1800+ CPU and 
SiS 745 chipset. Ethernet cards are 1000BT cards with 
NS 83820 single-chip Ethernet controllers. 

In the following, we will describe how we run parallel 
program on GRAPE-6. First, let us concentrate on the 
parallelization within a cluster. 

Figure 2 shows one cluster. Four processor boards are 
connected to a host computer through a network board. 
Four network boards are connected to each other, so that 
we can use a cluster as single unit or as multiple units. 

First consider the simplest case, where we just use 4 
hosts to run independent calculations. In this case, 4 pro- 
cessor boards connected to a host through one network 
board calculate the forces on the same set of particles, but 
from different set of particles [what we called j-parallclism 
in Makino et al. (1997)]. Each processor board stores the 
different subset of particles in the particle memory, and 
calculates the forces on the particles stored in the regis- 
ters in the processor chips. The partial forces calculated in 
different boards are sent in parallel to the network board, 
where they are added together by an adder tree. The host 
computer receives the summcd-up forces. As will be dis- 
cussed later, multiple processor chips on one board also 
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Fig. 2. A GRAPE- 
board. 



cluster. "NB" indicates a network 



have their local memories to store particles. They cal- 
culate the forces on the same set of particles, but from 
different sets of particles. The partial forces are summed 
up by the adder tree on the processor board. From the 
logical point of view there is no difference between a single- 
board system and multi-board system, as far as we use a 
single host. We can regard the entire system just as a 
huge adder tree with processor chips at all leaves. 

When all 16 boards and 4 hosts are used as a single 
unit, the particles are divided to 4 groups and each group 
is assigned to one host. Conceptually, the j-th board con- 
nected to host i calculates the force on particles in host i, 
from particles in host j. Summation of the partial forces 
is performed in the same way as in the case of single-host 
calculation. The only difference is that the data to be 
stored in the memory come from other hosts. 

In order to allow both single-host and multi-host calcu- 
lations, the network board must switch between broadcast 
mode (for the single-host calculation) and point-to-point 
mode (for the multi-host calculation). It would also be 
useful if we can use two hosts together. In this case, it is 
necessary to accept two inputs, and to pass each of them 
to two boards. Thus, we need three operation modes for 
the network board. One simple way to implement these 
three modes is shown in figure 3. Here, nodes A and B 
simply output the inputs from the left-hand side ports to 
two output ports. Nodes C, D, and E can select one from 
two inputs. In the case of node C, the selected input is 
sent to two output ports. 

This network can be configured in three ways. In the 
first mode, all nodes select input from the lower ports in 
figure 3. In other words, C takes input from input port 
2, D from input port 1, and E from input port 3. In 
this case, each GRAPE receives data from the input port 
with the same index. In this mode we can use this 4- 
GRAPE network as part of 4-host, 16-GRAPE system. 
In the second mode, node C selects the input from port 2, 
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Fig. 3. A 4-input example of switching network for parallel 
GRAPE. 

while D and E selects data from upper input port in figure 

3 (nodes B and C for nodes D and E. respectively). In this 
mode, GRAPEs and 1 receive the same data from port 
0. Similarly, GRAPEs 2 and 3 receive the data from port 
2. In other words, GRAPEs and 1 (and 2 and 3 as well) 
are effectively bundled together to behave as one system, 
and we can use this system as a part of 2-host, 8-GRAPE 
system. In the third mode, all nodes select upper inputs, 
thereby sending the data from port to all GRAPEs. In 
this way, we can use this 4-GRAPE network as a single 
system connected to one host. 

An important character of this network is that its hard- 
ware cost is 0(p), where p is the number of GR APE hard- 
wares. Thus, even for very large systems, the cost of the 
network remains small. By using this hardware network 
to send data from multiple host to processor boards under 
one host in parallel, we can improve the parallel efficiency 
quite significantly. 

There are many possible algorithms to parallelize the 
calculation over multiple clusters. Here we show just one 
example, which is a generalization of the "copy" algorithm 
(Makino 2002). In the copy algorithm, each node has 
the complete copy of the system. At each timestep, each 
node, however, integrates its own share of particles, which 
is either statically or dynamically assigned to it. After 
one step is finished, all nodes broadcast particles they up- 
dated, so that all nodes have the same updated system. 
In the case of multi-cluster calculation, each cluster has 
a complete copy of the system, which is distributed to 

4 hosts. For example, host of cluster and host of 
cluster 1 have the same data. In the time integration, cal- 
culation load is divided between all hosts in the different 
clusters with the same internal index. After one step is 
finished, updated data are exchanged again between hosts 
in different clusters with the same index. One could use 
"ring" algorithm or 2-D algorithm (Makino 2002), but 
for 4 clusters the difference in the performance is rather 
small. 

In principle, we could extend the network board to form 
8- input, 8-output switch, so that we can use all 64 boards 
as a "single cluster" . We decided not to do this since for 
many scientific applications we will use the system as a 
correction of single-host systems to run multiple simula- 
tions independently. To run multiple calculations, it is 
more efficient to have larger number of host computers. 
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Fig. 4. The structure of the processor board. 



2.3. board-level structure 

Figure 4 shows the structure of a processor board. It 
houses 8 processor modules. The processor board has one 
broadcast network that broadcasts data from the input 
port to all processor modules, and one reduction network 
that reduces the results obtained on 32 chips and returns 
it to the host through the output port. 

Each processor module consists of 4 processor chips each 
with its memory, and one summation unit. The structure 
of a processor module is the same as that of the proces- 
sor board, except that it has 4 processor chips instead of 
8 processor modules. Figure 5 shows the structure of a 
processor module. 

The memories attached to one processor chip can store 
up to 16,384 particles. Thus, a single board with 32 chips 
can handle up to 524,288 particles, for direct summation 
code with individual timestep. A 4 x 4-board cluster can 
handle up to 2 million particles. If one wants to use more 
than 2 million particles with direct summation, it is pos- 
sible to use the ring algorithm (see section 5.2). The cal- 
culation with 8 million particles is theoretically possible 
on a single cluster with 16 processor boards. 

In the next two sections, we present the detailed de- 
scription of the hardware, in a bottom-up fashion. In 
section 3, we describe the processor chip and in section 4 
the processor board, network board and interconnection. 

3. The processor chip 

The GRAPE-6 processor chip was fabricated using 
Toshiba TC-240 process (nominal design rule of 0.25/j,m. 
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Fig. 5. The structure of the processor module. 
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Fig. 6. The block diagram of the processor chip. 

The physical size of the chip is roughly 10 mm by 10 mm, 
and packaged into 480-contact BGA package. It operates 
at 90 MHz clock cycle. Power supply voltage is 2.5V. Heat 
dissipation is around 12 W at the maximum. 

A processor chip consists of six force calculation 
pipelines, a predictor pipeline, a memory interface, a con- 
trol unit and I/O ports. Figure 6 shows the overview of 
the chip. In the following, we discuss each block in turn. 

3.1. Force calculation pipeline 

The task of the force calculation pipeline is to evaluate 
equations (l)-(3). It also determines the nearest neighbor 
particle and its distance. This function is rather conve- 
nient for detecting close encounters or physical collisions 
between particles that require special treatments. For this 
purpose, the indices of particles that exert forces are sup- 
plied to the pipeline. 

The indices are also used to avoid self-interaction. The 
force calculation pipeline has the register for the index of 
the particle for which the force is calculated, and avoids 
the accumulation of the result if two indices are the same. 
This capability is introduced to avoid the need to send 
particles twice to the memory in the case of the individual 
timestep algorithms. 

With the individual timestep algorithm and the hard- 
wired predictor pipeline, the data of particles which ex- 
ert forces are evaluated by the predictor pipeline on chip, 
while the data for the particle for which the force is cal- 




Fig. 7. The block diagram of the force calculation pipeline. 

culated is evaluated on the host computer and sent to the 
register of the force calculation pipeline. These two values 
are not exactly the same, since the data format and ac- 
curacy of the hardware predictor are different from that 
of the host computer. GRAPE-4 pipeline did not have 
the logic to use the particle index, and the only way to 
avoid the self interaction was to make the data exactly the 
same. To achieve this, for the particles to be updated, we 
sent the predicted data at the current time to the memory 
as well as the registers. This means that we had to send 
j-particles twice per timestep. With the index-based ap- 
proach, we need to send j-particles only once per timestep, 
resulting in a significant reduction in the total amount of 
communication. 

For GRAPE-6 pipeline, we adopted the 8-way VMP 
[virtual multiple pipeline, Makino et al. (1997)], in which 
single physical pipeline serves as eight virtual pipelines, 
calculating the forces on 8 different particles. In this way, 
we can reduce the requirement for the memory bandwidth 
by a factor of 8, since all VMPs (and also physical multiple 
pipelines on a chip) calculate the forces from the same 
particle. 

In the physical implementation of the pipeline, we 
adopted several different number representations, depend- 
ing on the required accuracy. For input position data, 
we used 64-bit fixed point format. The reason we used 
the fixed point format here is to simplify the hardware. 
Additional advantage of using the fixed-point format is 
that the implementation of the periodic boundary condi- 
tion is simpler than that in the case of the floating-point 
data format (Fukushige et al. 1996). 

After first subtraction between two position vectors, the 
result is converted to floating-point format with 24-bit 
mantissa. Here, floating-point format is preferred, since 
othcrwize wc need very large multipliers. 

For the final accumulation, we return to the 64-bit fixed- 
point format, again to simplify the hardware. Here, we 
specify the scaling factor for each particle, so that we can 
calculate the forces with very different magnitude, without 
causing overflow or underflow. 

The pipeline for the calculation of the time derivative 
is designed in a similar way, but with 20-bit mantissa for 
intermediate data and 32-bit fixed-point format for the 
final accumulation. Since the time derivative is one order 
higher than the force, the required accuracy is lower. 
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Table 1. Arithmetic operations in force calculation pipeline 



# Operation Format length mantissa 



24 
24 

24 

24 

24 
20 
20 
20 
20 



Figure 7 shows the block diagram of the pipeline. It con- 
sists of arithmetic units to perform the operations shown 
in table 1 

We briefly discuss each operations below. 

3.1.1. dx <— Xj — Xj 

The position data are expressed in the 64-bit 2's comple- 
ment fixed-point format. The result of the subtraction is 
then converted to the floating-point format. The floating- 
point format used here consists of the sign bit, the zero 
bit, 10-bit exponent and 24-bit mantissa. The sign bit 
expresses the sign (one denotes a negative number). The 
zero bit indicates if the number is zero or not (one indi- 
cates that the expressed value is zero) . In standard IEEE 
floating-point format, a zero value is expressed in a spe- 
cial format (both exponent and mantissa are zero). This 
convention is useful to achieve the maximum accuracy for 
a given word length. However, using zero bit is more cost- 
effective in the internal expression for the hardware, since 
the logic to handle zero value is greatly simplified. 

For the result of subtraction, the range of exponent is 
6 bits. We use a biased format, and extend the exponent 
to 10 bits. For all floating-point operations, we use 10-bit 
exponents, to avoid overflows in intermediate results (in 
particular for r~ 5 ). 

The length of the mantissa is 24 bits, with usual "hid- 
den bit" at MSB (most significant bit). For the rounding 
mode, we used the "force-1" rounding, with the correc- 
tion to achieve unbiased rounding. With the "force-1" 
rounding, we always set the LSB of the calculated result 
(after proper shifting) to be one, regardless of the contents 
of the field below LSB. Thus, if the LSB is already one, 
the result is rounded toward zero, and if the LSB is zero, 
the result is rounded toward infinity. Thus, this rounding 
gives almost unbiased result. 

However, in this simple form this rounding is still bi- 
ased, since the treatment for the case where all bits below 
LSB are zero is not symmetric. Consider the following 
example, where we use 4 bits for mantissa and calculated 
result is in 8 bits (for example with multiplication). If 



the result is 10010000 in binary format, it is rounded to 
1001. If the result is 10000000, it is also rounded to 1001. 
Thus, out of 32 possible combinations of LSB bit and 4 
bits under LSB, for 16 cases the rounding is upward, 15 
cases downward, and one case no change. This gives slight 
upward bias for the rounded result. 

One way to remove this bias is not to force one if all 
bits below LSB is zero. This can be implemented with 
a rather simple logic, and we used this method with all 
floating-point arithmetic units used in GRAPE-6. 

Compared to the usual "round-to-the-nearest-even" 
rounding, this bias-corrected rounding is significantly eas- 
ier to design and test. In particular, there is no need for 
the conditional incrementer that would be necessary with 
the usual nearest rounding. Of course, this simpler de- 
sign does not mean smaller number of gates, since our 
rounding requires the length of the mantissa longer than 
that for the nearest rounding by 1 bit. However, it is also 
true that the additional number of gates is rather small, 
because we do not need the conditional incrementer. 

3.1.2. dr 2 s <- |dx| 2 + e 2 

These are realized with usual floating-point multipliers 
and adders. The design of the adder used here is simpler 
than that of general-purpose floating-point adders, since 
we know that both operands arc positive. This means that 
the result of the addition of the two mantissas is always 
larger than the larger of the two operands, and we need to 
shift the result at the maximum by one bit. With general- 
purpose adder, if two operands have similar magnitudes 
and different signs, the result of the addition can be much 
smaller, and we need a shifter with the capability to shift 
up the result by up to the length of the mantissa itself. 

3.1.3. calculation of r J a 

Here, we followed the design of the GRAPE-4 chip, 
where we used segmented second-order polynomial to cal- 
culate r§ = r,T 5 . We then multiply r§ by mj, and then by 
r 2 twice to obtain m^r" and rrij/r s . 

To calculate r~ 5 , we first normalize r 2 to the range of 
[1/4, 1). In other words, r s is expressed as 2 2a+b ■ c, where 
a is an integer number, b is either or 1, and c is in the 
range of [1/2,1). The "exponent" a is multiplied by —5 
to obtain the exponent of the resulted rj 5 . 

We used a table with 512 entries to obtain the coeffi- 
cients for the polynomial. This table accepts b and eight 
MSB bits (excluding the hidden bit) as the input address. 
The output of the table consists of the coefficients for 
the second-order term (12 bits), first-order term (18 bits), 
zeroth-order term (24 bits) and exponent (3 bits). Note 
that the calculated result is always smaller than the ze- 
roth order term, since both the first and second derivatives 
have minus signs. Therefore, the MSB of the calculated 
result can turn to zero, even though MSB of the zeroth 
order term is always one. In this case, we need to shift the 
result by one bit, and adjust the exponent by one. This 
adjusted exponent is then added to previously calculated 
exponent to obtain the exponent of the final result. 
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3.1.4- 4>%i = m j r s 1 

As described in the previous subsection, we actually cal- 
culate rrijr~ 5 and then multiply it by r 2 s twice. These mul- 
tiplications are usual floating-point multiplications, with 
the bias-corrected force-1 rounding. 

3.1.5. (f>i = 4>i + <f>ij 

The potential is accumulated in the 64-bit fixed-point 
format. The pairwise potential (f>ij, which is obtained in 
the floating-point format, is shifted before addition ac- 
cording to the shift length — s$ j j , where is the value of 
exponent of the pairwise potential <f>ij and s^^ is the scal- 
ing coefficient for the potential of particle i. Note that the 
coefficient s^.j is specified on the per-particle basis and we 
can specify different values for 48 virtual pipelines. This 
coefficient should be calculated from a reasonably good 
estimate of the total potential of particle i, to avoid both 
overflow and underflow during calculation. 



3.1.6. 



5 dx 



These are usual floating-point multiplications. 

3.1.7. &i = &i + ajj 

Here, we use the same design as that for the accumu- 
lation of the potential. The scaling coefficient is common 
for all three components. 

3.1.8. dv <— Vj -Vj 

This and the remaining operations to be discussed in 
this section are all for the time derivative of the forces. 
For these operations, we use the number format with the 
20-bit mantissa. For this first subtraction, the mantissa 
of input is 24 bits, and the result is given with the 20-bit 
mantissa. 

3.1.9. s = dv-dx 

This is an inner-product of two vectors in three dimen- 
sions. The mantissa of dx is first truncated to 20 bits. 

3.1.10. &i = &i + kij 

As in the case of the potential and the force, we use 
a fixed-point format with scaling coefficient for the final 
accumulation. Instead of the 64-bit format, however, here 
we used a 32-bit format, since only the contributions from 
nearby particles are important for the time derivative. 

3.1.11. Cutoff unit 

In figure 7, there are three boxes in "12-bit fixed" format 
region. These are used to implement the Gaussian cutoff 
of the 1/r potential, to be used with Ewald summation 
method for the calculation of the gravitational force with 
the periodic boundary condition. The details of the oper- 
ation of these boxes will be described elsewhere, with the 
discussion of the performance and accuracy of the Ewald 
method on GRAPE-6. In this paper, we can regard these 
two boxes, "Pcut" and "Fcut" as just boxes with constant 
(unit) outputs. 
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Fig. 8. The block diagram of the neighbor list unit. 

3.2. Neighbor list unit 

The neighbor list unit of GRAPE-6 chip is essentially 
the same as that of GRAPE-4 board. It consist of two 
memory units, one for the indices of j-particles and the 
other for flags to indicate (virtual) pipelines. One neigh- 
bor list unit serves 16 virtual pipelines (two physical 
pipelines). Thus we integrated three units to one chip. 
One neighbor list unit can store up to 256 neighbor par- 
ticles. 

Figure 8 shows one neighbor list unit. Each pipeline has 
registers (for each of the virtual pipelines) for the neighbor 
radius squared h 2 , and if the distance to the current j- 
particlc is not larger than the neighbor radius, a flag is 
asserted. This flag is stored to a shift register. Once per 
every eight clock cycles, this shift register contains the 
eight flags from different VMPs for the same j particles. 
At this cycle, if any of 16 flags from 16 virtual pipelines is 
asserted, the index of the current j-particle and the flags 
themselves are written to the memory. 

3. 3. Predictor pipeline 

The predictor pipeline evaluates the predictor polyno- 
mials expressed in equations (7) and (8). As stated ear- 
lier, we used 8-way VMP for the force calculation pipeline. 
Therefore, the predictor pipeline can use eight clock cycles 
to produce the predicted position and velocity of one par- 
ticle. To take advantage of this fact, we implemented one 
pipeline, which processes x, y, and z components sequen- 
tially In principle, we could further reduce the hardware 
size by using one pipeline for both position and velocity. 
We did not adopt this approach since the circuit size for 
the predictor pipeline was already a small fraction of the 
total size of the chip. 

In the design of the predictor pipeline, we tried to min- 
imize the amount of data to express the predictor for one 
particle, since it directly affects the time needed for com- 
munication and the number of wires needed between the 
memory chips and the processor chip. 

With GRAPE-4, the predictor data for one particle was 
expressed in 19 32-bit words (2 for time, 6 for position, 
3 for each of velocity, acceleration, and time derivative of 
acceleration, 1 for mass, and one for memory address). 
With a similar format, GRAPE-6 predictor would need 
23 words, since we need one more word for particle index 
and three more for the second time derivative of the ac- 
celeration. In many applications, inclusion of the second 
derivative improves the accuracy rather significantly. 
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Fig. 9. The block diagram of the predictor unit. 

To reduce the data length, we adopted the following two 
methods. First, for the particle time, instead of sending 
the time itself, we send the location of the bit below which 
the current system time i can be different from the particle 



Fig. 10. The block diagram of the logic to handle subtrac- 
tion of the time. 



3.3.1. At*-t~tj 

The current system time t is expressed in the 64-bit 
fixed-point format. The particle time tj is expressed by 



time tj. In this way, we could reduce the number of bits the bit location n above which t and U are the same. 



to express time from 64 to just 7. 

Second, we use a block floating-point format with man- 
tissa length optimized for each of the predictor coefficients. 
Thus, we used 32, 20, 16 and 10 bits, for velocity, accel- 
eration, first and second time derivatives, respectively. 

With these two changes, we could make one predictor 
data to be expressed in 16 32-bit words. Thus, we could 
use 64-bit memory bus with clock speed the same as that 
for the pipeline, to supply one particle data in every 8 
clock periods. 

Figure 9 shows the block diagram of the predictor unit. 
The predictor pipeline performs the following opera- 
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Here, p, is the output of i-th arithmetic unit of the predic- 
tor pipeline for the position, and qi is that for the velocity. 
Notations like (a (2) /18) mean the values corresponding 
the expressions are supplied from the memory unit. 

In the following, we describe operations a, b, c, d, j and 
q. Other operations are simple fixed-point addition, mul- 
tiplication, or multiplication by a constant implemented 
in the way similar to that for operations b, d and c. 



This location is the location of MSB of the timcstep 
Ati 



Consider the following example. If tj =0.5 and 
: 0.125, the current time t must be in the range of 



[0.5,0.625]. In this case, t — tj can be 



Atj 

[tj,tj + Atj 

calculated by simply masking all bits equal to or higher 
than MSB of Atj (i.e., 0.125 and above). This works 
for any value of t in the range [0.5,0.625). However, if 
t = 0.625, this procedure returns 0, but the correct value 
is 0.125. This is simply because we masked the bit which 
represents the exact value of Atj. This problem can be 
solved by supplying the value of tj at that bit. Unless t 
is equal to tj + Atj , values of this bit for t and tj are the 
same. In this case, the corresponding bit of the resulting 
At must be 0. However, if t is equal to tj + Atj, values 
of this bit for t and tj are different. In this case, MSB of 
the result must be one. Thus, by taking XOR of the two 
input bits, we can determine the MSB value of the result. 
Figure 10 shows the actual circuit. Here, a is the value of 
the bit of tj which corresponds to non-zero bit of Atj and 
n is the location of that bit. The result is expressed in a 
24-bit unsigned fixed-point format. Here, the rounding is 
simple rounding to zero. This can cause very small bias 
in the predicted position, if the timestep of the current 
blockstep is very small and the timcstep of the predicted 
particle is large. In this case, however, the error in the 
prediction does not degrade the accuracy. Therefore we 
do not perform rounding correction here. 

3.3.2. pi^-At-(a( 2 Vl8) 

Both inputs are supplied in a 10-bit fixed-point format. 
Here, we use the sign-magnitude format, instead of the 
usual 2's complement format, to simplify the design of the 
multiplier. Note that At is supplied in the 24-bit format. 
Therefore we need to truncate it to 10-bit format, using 
the bias-corrected force-1 rounding discussed earlier. The 
result is also rounded to the 10-bit format. 

3.3.3. p 2 <- Pi -0.75 

This multiplication by constant is achieved by adding 
Pi/2 and pi/4. These two values can be calculated by 
shifting them to the right by one bit and two bits, respec- 
tively. These shiftings, in hardware, require just wiring 
and no logic. Thus, this multiplication is actually im- 
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plcmcnted by a single adder. Here we do not round the 
result, since the effect of the error in the multiplication in 
the predictor is usually very small. 

3.34. P3^P2 + {a/6) 

Here, P2 is in the 10-bit sign-and-magnitude format, 
and (a/6) is in a 16-bit format. Thus, we first extend P2 
to 16 bits. If two inputs have different signs, we need to 
determine which one is larger. We do this by calculating 
both a — b and b — a. If a — b does not cause overflow, we 
can see that a>b, and use a — b as the result. The sign of 
the result is the same as that of a. This circuit is rather 
complicated and larger than that for the 2's complement 
format. However, the gate count is practically negligible. 

3.3.5. Operations (e)-(i) 

All these are usual fixed-point addition or multiplica- 
tion, implemented in the same way as operations (b) and 
(d). 

3.3.6. Xp^x + ps 

Here, we add two numbers in different formats. One is 
x in the 64-bit 2's complement fixed-point format. The 
other is p% in floating-point format with sign, exponent 
and mantissa. Since we do not perform any normalization 
during the calculation of pg , the mantissa is not normal- 
ized. This means that we do not use the hidden bit for pg. 
We first shift p s according to the value of the exponent 
of the velocity and then add it to (or subtract it from) x 
according to the sign bit. 

3.3.7. Operations (k)-(p) 

These operations are implemented in the same way as 
similar operations for the position predictor pipeline arc 
implemented 

3.3.8. v p <—v + q 6 

This is essentially the same addition as used in other 
operations, but here we post-normalize the result. For 
the output format we use a mantissa with the hidden bit. 

3.4- Memory interface 

The memory interface has two functions. The first one 
is to write the data sent from the host, and the second 
one is to read the memory during the calculation. 

The data of one particle is packed into 16 32-bit words. 
A data packet sent from the host consists of two control 
words and this 16-word data. The first control word con- 
tains following three fields: command code (2 bits), chip 
identity (10 bits) , and chip identity mask (10 bits). The 
second word is the starting address in the memory for the 
particle data. 

The chip identity field is used to select the chip that 
actually stores the particle data. With the design of 
GRAPE-6, all chips on one board, or on multiple boards 
connected to the same host, receive the same data from 
the host. We, however, have to let different chips calcu- 
late the forces from different particles, and this can be 
achieved by specifying, in the particle data packet, the 



Table 2. GRAPE-6 chip input port signal definition 



signal 


width 


description 


DATA 


36 


32 bit data with 4 bit parity 


WE 


1 


write enable 



identity of the chip that actually store the data. When a 
chip receives one j-particle data packet, it writes the data 
to the memory only if the chip identity field of the packet 
(masked by the identity mask) is the same as its identity 
register (also masked by the identity mask). The identity 
register itself must be all different on different chips, and 
how we achieve this will be discussed in the next subsec- 
tion . The identity mask field is usually all ones. 

The memory interface is designed to control two 
SSRAM (synchronous static random-access memory) 
chips with 36-bit data width. All signal lines drive only 
one chip, so that we can minimize the signal length. Using 
the combined data width of 72 bits, we implemented ECC 
(SECDED or single error correct and double error detect) 
for the data received from the memory. 

The memory interface is programmable, in the sense 
that practically all access latencies can be adjusted by 
writing to on-chip registers. Thus, we can use almost any 
type of SSRAM with different access timings. 

During the calculation, both memory chips output data 
at every clock cycle. The memory address counter is ini- 
tialized to 8N, where N is the number of particles, and 
decremented at each clock. For writing the data, we use a 
slower access, where we write two SSRAM chips at alter- 
nate clock cycles. In this way, we can reduce the switching 
noise and can also relax the timing requirement for the 
data bus. 

3.5. I/O ports and handshake protocol 

Tables 2 and 3 show the signal definition for input and 
output ports. Both ports operates on the clock with the 
frequency 1/4 of that of the internal logic and memory 
interface. As a result, the communication bandwidth is 
rather limited. However, the electrical design of the board 
is easier with a lower clock speed. Also, with the 32-bit 
data width, we can still achieve the data transfer speed 
of around 100 MB/s, which is fast enough to match with 
the speed of PCI bus of the host computer. 

The input port is very simple, with data lines and a 
single write enable line. The chip actually has two input 
ports, one dedicated to the data sent to the memory (we 
call this the JP port), and the other for everything else 
(the IP port). On the JP port, the data of one particle 
consists of 18 32-bit words, and the control logic handles 
this 18- word packet. The IP port is a general-purpose 
port. It accepts variable- length data packet. The first 
word of the packet is the starting address of the on-chip 
register. The second word is the number of data words to 
follow, and remaining words are all data. 

The output port is more complicated, because we need 
to implement flow control. The reason we need flow con- 
trol is that for some data, for example for the neighbor 
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Table 3. FO port signal description 



Signal 


direction (I/O) 


description 


D0-D35 





data (4 bits for parity) 


VD 





valid data 


ND 





new data 


STS 





status 


ACTIVE 





if 0, chip is unused 


WD 


I 


wait data 



list, the host must receive the data directly from all chips. 
In the case of the force, all chips output the results syn- 
chronously and the onboard reduction network reduces 
the data on the fly. However, the neighbor list data has 
to be transferred to the host without any reduction. 

It is possible to read neighbor data from each chip with- 
out using hardware flow control, by let the host computer 
send the commands to each chip sequentially until it re- 
ceives all data. In this case, the processor chip itself does 
not need any flow control. However, this procedure would 
be rather slow, since the host has to setup the DMA trans- 
fer many times. Therefore, we chose to let the host to send 
the command to all chips. The reduction network takes 
care of the flow control. In table 3, the WD signal is used 
for flow control. 

When the WD signal is asserted, the chip stops sending 
new data. When the chip sends a new data, it asserts both 
VD and ND signals. The VD signal is asserted as long as 
the data is valid, but ND is asserted only when the data is 
actually updated. The STS line is a special signal which 
tells if the force calculation pipeline is working or not. The 
ACTIVE signal is used to indicate defective chips. The 
output of this pin is programmable from the host, and 
if ACTIVE is negated, the reduction network ignores the 
output from the chip. 

4. Processor board and network hardware 

4-1. Processor module and processor board 

Figures 4 and 11 show the processor board. A single 
board houses 32 processor chips. Logically, the design of 
the board is rather simple. The input data is broadcasted 
to all chips, and the output data of the chips are reduced 
through a reduction network. 

The nodes of the reduction network are made of FPGA 
chips. It has two operation modes, reduction mode and 
pass-through mode. In the reduction mode, it receives the 
data from lower-level nodes (either the processor chips or 
lower- level FPGA nodes), and performs reduction. Since 
one particle data consists of force, potential, time deriva- 
tive of the force, the distance and index of the nearest 
neighbor particle, and status flags, the operation of the 
reduction ALU need to change according to the data type, 
and is controlled by a sequencer. 

In the pass-through mode, a node sends the data re- 
ceived from the lower-level node without applying any op- 
eration. Since multiple lower-level nodes might try to send 
the data simultaneously, every node controls the WD sig- 




Fig. 11. The processor board. 

nal (which is also implemented in a node FPGA as well) 
so that only one chip (or node) actually sends the data at 
one time. When one chip (or node) indicates the end of 
the data by negating the VD signal, the node negates the 
WD signal for the next chip to start receiving the data 
from that chip. 

As can be seen in figure 11, one processor board is de- 
signed to house up to eight processor "modules" . A single 
module houses 4 processor chips, 8 SSRAM chips, and 
an FPGA chip which realizes the 4-input, 1-output re- 
duction tree. We made this division between the board 
and module, to make the manufacturing easier. With this 
separation, all BGA chips (with large number of pins) are 
mounted on small-size module boards. Thus, the rate of 
the soldering error should be lower, compared to the case 
where we mount them on large boards. In addition, if 
there is an error, only a module with 4 chips would be 
defective. Of course, having to connect the board and 
module through a connector increases the probability of 
the failure, but we expected that the failure rate of the 
connector is significantly lower than the failure rate of the 
soldering (which turned out to be the case) 

The tree nodes are implemented using Altera ACEX se- 
ries chips. In the lowest level (processor module level), we 
used EP1K50A chips in 484-connect BGA packages. This 
chip implements a four-input node. Higher levels are im- 
plemented on EP1K30A chips in 208-pin QFP packages. 
This chip implements a 2-input node. These nodes arc on 
the processor board. 

The processor board is an 8-laycr standard PCB. The 
processor module board is an 11-layer board with inner via 
holes. The FPGA and processor chips are mounted on the 
top side, and SSRAM chips on the bottom side. By this 
layout, we can minimize the wire length between SSRAM 
chips and processor chips, and still achieve rather high 
packaging density. We used 4Mbit SSRAM chips. Two 
SSRAM chips connected to a processor chip can store up 
to 16,384 particles. One board can store up to 524,288 
particles. 

The SSRAM chips we chose requires 2.5V power supply 
for I/O and 3.3V for core. Both the processor board and 
module board have separate power planes for both 2.5 and 
3.3V power supply. 

Though the chip has separate ports for j-particles and 
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other data, for the board we decided to use a common data 
line, to simplify the design and reduce the manufacturing 
cost. 

Currently, the core of the processor chip operates on a 
90 MHz clock, and the I/O part on a 22.5 MHz clock, the 
reduction network and other logics of the control board 
operate also on the 22.5MHz clock. 

For board-board connection, we used a semi-serial 
LVDS signal. We used 4-wire (3 for signals and 1 for trans- 
mission clock) chipset, which performs 7:1 parallel-serial 
conversion. Since our basic transfer unit is a 32-bit word, 
we used two cycles of this chipset to transmit one data. 
Thus, the chipset operates on a 45 MHz clock, and the 
signal lines operate at the data rate of 315 MHz. For the 
conversion between the 22.5 MHz data rate of the board 
logic and the 45 MHz data rate of the LVDS chipset, we 
used additional FPGA chip. 

With this LVDS chipset, the receiver chipset itself is 
driven by the clock signal which comes with the data. 
In order to allow the two boards connected to a link to 
operate on independent clocks, we added FIFO chips after 
the data rate is reduced to 22.5 MHz. 

The physical form factor of the card is that of an 8U 
Eurocard (with the length of 400mm). For the backplane 
connection, we used connectors designed for Compact PCI 
cards. The power supply is also from backplane bus, 
through special power connectors. 

It is possible to connect a single processor board di- 
rectly to the host through the host interface card, without 
using the network card. For this purpose, the processor 
board also has the connector for the twisted-pair cable for 
the LVDS interface. These connectors are standard RJ-45 
modular jacks widely used for 10/100/1000BT Ethernet 
connection. Standard category 5 (or enhanced 5) cables 
can be used for connection. 

For LVDS interface chips, we used SN75LVD85 and 
SN75LVD86A chips from Texas Instruments. 

4-2. Host interface card 

Figure 12 shows the block diagram of the host interface 
card. It is a standard (32-bit, 33 MHz) PCI card. To 
transfer the data from host to GRAPE-6, the host setup 
the data to be transferred in its memory and let the PCI 
interface chip on the interface card perform DMA transfer. 
The data received by this DMA transfer is sent directly 
through the output link. In the design of the host interface 
card we implemented two output ports so that they can 
separately supply data to the JP and IP ports. As stated 
earlier, we decided to use only one port for the processor 
board. Therefore, the second output port of the interface 
card is not used. 

The input port is more complicated, with an FIFO 
memory to store the received data. This FIFO memory is 
necessary, since we cannot guarantee the response time of 
the host operating system to the DMA request from the 
interface card. We need to have the memory large enough 
to avoid any possible overflow. 

For the PCI interface, we used the 9080 chip from PLX 
technology. 
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Fig. 12. The host interface board. 
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Fig. 13. The network board. 



4-3. Network board 

Figure 13 shows the block diagram of the network 
board. It has two basic functions. One is to broadcast 
(or multicast) the data received from the host (or possibly 
higher-level network boards) to the processor boards (or 
lower- level network boards). This part is shown as IJP- 
UNIT. The other is the reduction network for the calcu- 
lated result, shown as FO-UNIT. The reduction network 
is exactly the same as that on the processor board, ex- 
cept for the fact that the interface to the module board 
is replaced by the interface for the processor board (with 
LVDS link chipset). 

The IJP-UNIT has 4 input ports. One of them is a spe- 
cial port designed to connect to the host. Other three 
ports are designed to accept data from other network 
boards (see figure 2). Each of the four input ports has 
a "copy" output, shown in the left-hand side of the unit, 
so that we can cascade multiple network boards. 

Figure 14 shows the block diagram of the multicast net- 
work. The boxes in the center of the figure are all data 
buffers with output enable control input, which realize the 
multicast network. Note that this structure implements 
the network logically equivalent to what is shown in figure 
3. 

The control input for these buffers are supplied from the 
control logic implemented on the FPGA for 45MHz-22.5 
MHz data rate change. This FPGA integrates a sequencer 
to decode IP/JP port data packets, which reacts to the 
address space assigned to the network board. 

4-4- Packaging and Power distribution 

In the standard configuration, eight processor boards 
and two network boards arc installed in a card rack with 
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Fig. 14. Physical implementation of multicast network. 

a special backplane for the LVDS link. The network board 
is a single- height unit, but the processor board occupies 
two- unit height, to allow sufficient airflow. 

To both of network boards and processor boards, the 
electrical power is supplied through backplane connectors. 
However, in our present packaging, each processor board 
has its own power supply unit. A power supply unit ac- 
cepts DC 330V input, and supplies DC 2.5V and 3.3V. 
The DC 330V power is generated by another power unit 
from three-phase AC power line. For all these power units, 
we used products from Vicor. 

We chose Vicor product primarily to reduce the re- 
sponse time of the power supply to the change in power 
consumption by the boards. One advantage of the CMOS 
logic is that it consumes power only when the logic state 
changes. This means that even though we had paid ab- 
solutely no effort to reduce the power consumption of 
the chip, its power consumption almost halves when the 
pipeline arc not active. 

This "feature" of the chip is rather good from the point 
of view of the running cost of the machine, but pauses 
a rather serious problem to the power supply. The typi- 
cal response timescale of a switching power supply units 
is of the order of one millisecond. On the other hand, 
GRAPE-6 switches between calculation and idle (or com- 
munication) states also in about one millisecond. This 
means that the response time of the power supply is too 
long to compensate for the change in the load between the 
calculation state and the idle state, and the supply voltage 
becomes rather unstable. Thus, we had to look for power 
supplies with a relatively short response time. For switch- 
ing power supplies, a short response time means high op- 
crating frequency, and Vicor products had the highest fre- 
quency among commercially available power units. 

Even with high-frequency power supplies, the response 
time was still the order of 100 microseconds, and the only 
way to stabilize the power supply is to add large bypass ca- 
pacitors. We attached capacitors with total capacitance of 
about 0.1F to the 2.5V power line of each processor board. 
We could not use usual alminium electrolytic capacitors 
because their internal resistance (equivalent series resis- 




Fig. 15. The 64-board, 4-cluster GRAPE-6 with the racks 
for host computers in front. 

tance, ESR) is too large. We used low-ESR electrolytic 
capacitors from Sanyo to meet our need. 

In hindsights, it would be probably better to design 
a small switching power supply unit integrated into the 
processor module, since such a power supply unit, which 
is used on every motherboard for PCs, is inexpensive and 
highly reliable. 

Figure 15 shows the complete GRAPE-6 system consist- 
ing of five racks (three with two subracks and two with one 
subracks), with 16 host computers in front of them. Host 
computers are Linux-running PCs, with AMD Athlon XP 
1800+ processors and ECS K7S6A motherboards. They 
are connected with Gigabit Ethernets. The total power 
consumption of the system is around 40 KW, when in full 
operation. 

5. Differences between GRAPE-4 and GRAPE-6 

As described in the previous sections, the architecture 
of GRAPE-6 is quite different from that of GRAPE-4, 
even though it is the direct successor of GRAPE-4 for 
essentially the same goal. In this section, we describe 
what design changes are made why. 

5.1. Differences in the semiconductor technology 

The primary difference is that for GRAPE-6 processor 
chip we used 0.25/mi design rule, while with GRAPE-4 
we used 1/im design rule. This difference with additional 
advance in wiring enables us to integrate roughly 20 times 
larger number of transistors, with 3-4 times faster clock 
speed. Thus, roughly speaking, a single GRAPE-6 chip 
offers the speed two orders of magnitude higher than that 
of GRAPE-4. 

This large advance, however, implies almost every de- 
sign decision had to be changed. In the following, we 
summarize the changes made. 

5.2. The host computer and overall architecture 

In GRAPE-4, 4 clusters are connected to a single host, 
sharing one I/O bus. For the peak speed of 1 Tflops, 
the single host was still okay for simulations with large 
number of particles (10 5 and larger), and communication 
through a single I/O bus was also okay. 
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Fig. 16. A simple parallel-host, parallel-GRAPE system. 

With GRAPE-6, however, the peak speed is increased 
by a factor of 60. On the other hand, the speed of a 
single host would be improved only by a factor of 10 or 
so, if we assume the standard Moore's law (performance 
doubling time of 18 months). Thus, if we want to achieve 
a reasonable speed for similar number of particles as that 
for GRAPE-4, we need to use around 10 host computers 
and the communication channel must be 10-20 times faster 
than that used for GRAPE-4. 

Around the time of the design, it was clear that a 
shared-memory multiprocessor system with 8-16 proces- 
sors and sufficient I/O bandwidth would be prohibitingly 
expensive, with the price tag of the order of 1 M USD. On 
the other hand, a cluster of 8-16 single-processor worksta- 
tions or PCs would be much less expensive. As far as the 
cost is concerned, clearly a cluster of single-processor ma- 
chines was better than a shared-memory multiprocessor 
system. 

One problem with the cluster is that the simplest con- 
figuration (see figure 16) does not work. The reason is the 
following. 

With this configuration, there are two different ways 
to distribute particle data over processors (Makino 2002). 
One is that each processor has the complete copy of the 
system (the "copy" algorithm). In this case, parallcliza- 
tion is performed as follows. At each blockstep, each 
processor determines which particles it updates. After 
all processors update their share of particles, they ex- 
change the updated particles so that all processors have 
the updated copy of the system. This algorithm has 
been used to implement the individual timestep algorithm 
on distributed-memory parallel computers (Spurzem & 
Baumgardt 1999) 

In this algorithm, at the end of the block timestep each 
processor receives the particles updated on all other pro- 
cessors. This means the amount of communication is in- 
dependent of (or, strictly speaking, is a slowly increasing 
function of) the number of processors, and the overall per- 
formance of the system is limited by the speed of commu- 
nication. 

The other possibility is to let each processor to have a 
non-overlapping subset of the system, so that one parti- 
cle resides only in one processor. In this case, with the 
blockstep algorithm we need to pass around the particles 
in the current blockstep, so that each processor can calcu- 
late the forces from its own particles to particles on other 
proccssors(thc "ring" algorithm). The amount of com- 



munication (host-host and host-GRAPE) per blockstep is 
again independent of the number of processors. This algo- 
rithm is also implemented on distributed-memory parallel 
computers with direct summation (Dorband et al. 2003) 
and even with the tree algorithm (Springel et al. 2001). 

For general-purpose parallel computers, this simple al- 
gorithm actually works rather well, simply because the 
calculation speed of single node is so slow. Even a cluster 
with several hundred nodes is still slower than a single 
GRAPE-4. So the communication speed of 10-100 MB/s 
is sufficient. However, with GRAPE-6 we do need a faster 
speed. 

Now we understand that it is possible to use a hybrid of 
the above two algorithm to solve the bottleneck (Makino 
2002). In this hybrid algorithm, we organize processors 
into two-dimensional grid, and distribute the particles so 
that each row (and each column) has the complete copy 
of the system. 

In the standard realization, this algorithm requires that 
total number of processors is r 2 , where r is a positive 
integer number. We divide N particles into r subsets, 
each with N/r particles. If we number processors from 
Pn to p rr , processor pij has the copy of both z-th and 
j-th subsets. 

At the beginning of the each blockstep, each processor 
selects the particles to be updated from subset i. Then 
all of them calculate the force on them from subset j. 
After that, the total forces can be calculated by taking 
summation over columns. Here, we assume the summed 
results are obtained on diagonal processors pa. 

After particles in the current block are updated on pa, 
they are broadcasted to all other processors in the same 
row (pxi) and also in the same column (pi X ) so that both 
subsets i and j are updated on each processor. 

In this algorithm, the amount of communication for one 
node is 0(N/r). In other words, the effective communi- 
cation bandwidth (both host-host and host-GRAPE) is 
increased by a factor r. Thus, the communication speed 
is improved by a factor proportional to the square root of 
the number of processors. 

At present, this solution looks fine, since the price of 
the fastest single-processor frontend is now rather cheap. 
The cost of the communication is also rather cheap, with 
Gigabit Ethernet adapters available for less than 100 USD 
per unit. 

When we started the design of GRAPE-6 in 1996, we 
did not expected such a drastic change in the price of 
fast frontend processors. At that time, RISC micropro- 
cessors were still several times faster than PCs with IA-32 
architecture, and 100Mbit Ethernet adapters were still ex- 
pensive. Thus, we had to come up with a design that did 
not need r 2 processors or fast host-host communication. 

It was not really difficult to come up with such a de- 
sign, since the only thing non-diagonal processors does is 
the force calculation. Instead of two-dimensional grid of 
host processors, we can construct a two-dimensional grid 
of GRAPE hardwares with orthogonal broadcast networks 
(figure 17). The GRAPE hardwares in the same row store 
the same data to their particle memories. When they cal- 
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Fig. 17. Two-dimensional network of GRAPE hardwares 
connected to one-dimensional host network. 

culate the forces, GRAPEs in the same column receive the 
same particles and calculate forces on them from particles 
in the memory. The calculated results on boards in the 
same column are then summed and returned to the host. 

One practical problem with this network architecture is 
that we cannot divide the system to smaller configurations 
so that we can run multiple programs. In the case of 
r 2 hosts, we can divide the system to any sub-squares, 
down to r 2 single host-GRAPE pairs. In the case of 2D 
hardware network, we do not have any such division. This 
problem can be partly circumvented by attaching a simple 
switching network before memory interface, so that they 
can select input. So we adopted the network structure 
shown in figure 3. 

In the final design of GRAPE-6, we actually adopted 
a hybrid of host-grid approach and GRAPE-network ap- 
proach, to make a reasonable compromise between the 
flexibility and absolute performance. Of course, this 
shift from the pure hardware network to hybrid one is 
made partly because we took into account the evolution 
of the host computers during the development period of 
GRAPE-6. It has become more cost effective to use large 
number of inexpensive (yet fast) computers as host than to 
have an elaborate hardware network to connect GRAPEs 
to small number of hosts. 

5.3. board-board connection 

GRAPE-4 consisted of 36 processor boards, organized 
in a two-stage simple tree network. Nine boards are 
housed in one rack, with one backplane bus. These boards 
are all connected to a control board, which broadcasts 
the data from the host to all processor boards and take 
the summation of the calculated data on nine processor 
boards. Since all boards are connected through a shared 
backplane bus, the control board has to access processor 
boards sequentially. In order to improve the data transfer 
rate, we used a wide data bus with the width of 96 bits. 

The connection between the control board and the host 
was a 32-bit parallel connection through a coaxial flat ca- 



ble. This connection is robust and reliable, but had three 
drawbacks: it was physically large, it was difficult to use 
long wires, and it was pretty expensive. Because a com- 
mon clock signal is used on the both side of the connection, 
the wire length is limited by the allowable signal skew, 
which means it is difficult to use fast clock (GRAPE-4 
used 16 MHz clock). 

A more practical problem is that board-board wiring 
would become too bulky and cumbersome, with hundreds 
of flat cables and nearly 10,000 contact points, if we use 
the same connection for GRAPE-6. In particular, it would 
be difficult to design the network board, since it needs to 
have more than 10 connectors. Also, it would be imprac- 
tical to use backplane to connect the network board and 
processor boards, since the number of pins on the network 
board would be too large. 

An obvious solution for this problem is to use a fast 
serial signal, such as the physical layer of the Gigabit 
Ethernet. At the time of our design decision, however, 
Gigabit Ethernet was unpractical because copper wire 
connection was not available in 1998. Optical connec- 
tion would be too expensive and would dissipate too much 
heat. 

We adopted what is called "LVDS Link" or "Flat Panel 
Display (FPD) link" , which uses four twisted-pair differ- 
ential signal lines (three for signals and one for clock). The 
reason we chose this interface was that inexpensive seri- 
alizer/deserializer chips were commercially available and 
that we could use standard category 5 shielded 4-pair ca- 
bles for 100Mbit Ethernet cable and its connectors for 
reliable data transmission, for the cable length of up to 
about 5 meters. 

Additional advantage of this choice is that we can use 
backplane connection (with custom-designed signal pat- 
tern) for connection between the network board and pro- 
cessor boards. Because the number of signals is small (8 
for one port), we can pack many ports into a standard 
backplane connector (we adopted Compact PCI connec- 
tor). 

5.4- Pipeline chip and memory interface 

The processor chip for GRAPE-4 had a single pipeline, 
which calculates the force on two particles in every six 
clock cycles (2-way VMP). During force calculation, the 
chip receives the data of one particle (position, velocity 
and mass) in every three external clock cycles, and the 
width of the input data bus was 107 bits. 

One GRAPE-4 board housed 48 pipeline chips, all of 
which receive the same particle data from the memory 
and calculate the force on two particles. This means that 
a single board calculates forces on 96 particles in parallel. 

This shared-memory architecture is simple to imple- 
ment. However, we could not use this architecture for 
GRAPE-6, since the hardware parallelism would become 
excessively large. The pipeline chip for GRAPE-6 would 
be roughly 50 times faster than that for GRAPE-4. Thus, 
even if we somehow increase the data transfer rate by a 
factor of 5, the number of particles on which the forces 
are calculated in parallel would increase by a factor of 10, 
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from 100 to 1,000. This number is too large, if we want 
to obtain a reasonable performance for simulations of star 
clusters with small, high-density cores. Note that with 
multiple-board configurations, this number would become 
even larger. On anrxr two-dimensional system, the de- 
gree of parallelism becomes larger by a factor of r. 

The data transfer rate of GRAPE-4 chip was about 200 
MB/s. To keep the degree of parallelism to be around 
100 or less, the GRAPE-6 chip would have to have the 
data transfer rate of 5 GB/s, which was well beyond our 
capability of designing and manufacturing. At 100 MHz 
clock, the speed of 5 GB/s requires 400 input pins. It is 
quite difficult to have 400 signal lines, all with 100 MHz 
data rate, to connect more than a few chips. 

Clearly, a different design was necessary. Too large de- 
grees of parallelism arose from our decision to let a large 
number of chips to share one memory unit. If we re- 
duce the number of chips to share the memory, thus, we 
can solve the problem. The extreme solution is to attach 
one memory unit to each pipeline chip, and let multiple 
pipelines to calculate the force on the same set of chips, 
but from different set of particles. 

This extreme solution has one important practical ad- 
vantage. The connection between the processor chip and 
its memory is point-to-point, and physically short (since 
we can put a processor chip and its memory next to each 
other). This means a high clock frequency, such as 100 
MHz, is relatively easy to achieve. 

To attach memory chips directly to the processor chips, 
we need to integrate the predictor pipeline and the mem- 
ory controller unit (generation of address and other con- 
trol signals) to the processor chip. These do not consume 
much transistors. Therefore it docs not have any effect to 
the performance of the chip. 

With GRAPE-6, we adopted a 72-bit (with ECC) 
data width for transfer between memory and the proces- 
sor chip. A GRAPE-6 chip integrates six 8-way VMP 
pipelines. Therefore it calculates the forces on 48 parti- 
cles in parallel. All pipelines on board calculate the forces 
on the same set of particles. Thus, even with the largest 
configuration we considered (an 8x8 system) , the degree 
of the parallelism is still less than 400, not much different 
from that of full-size GRAPE-4 (which was also 400). 

This change from shared memory design to local mem- 
ory design implied we had to take summation of large 
number of partial forces obtained on chips on one board. 
With GRAPE-4, we also had to take summation of forces 
obtained on different boards, and we used commercially 
available single-chip floating-point arithmetic units for 
this summation. With GRAPE-6, we could not apply 
this solution simply because such chips no longer existed. 
Thus, we have to either integrate this summation function 
into the processor chip, or develop another chip to take 
summation. 

We adopted the latter approach, but used FPGA (Field- 
programmable gate array) chips to implement adders. It 
was not impossible to integrate floating-point adders into 
FPGAs, but such a design would require rather large, ex- 
pensive FPGA chips and a complex design. In order to 



simplify the design, we chose to use a block floating point 
format for the force and other calculated result. In this 
format, we specify the exponent of the result before we 
start calculation. The actual value of exponent can be 
different for forces on different particles, so that we can 
calculate the forces with wildly different magnitudes in 
parallel. 

With this block floating point method, we can greatly 
simplify the design of the hardware to take the summa- 
tion. Of course, we have to supply the value of exponent, 
but the value of the exponent at the previous timestep is 
almost always okay. For the initial calculation, we some- 
times need to repeat the force calculation a few times until 
we have a good guess for the exponent. 

A rather important advantage of using the block float- 
ing point format is that the calculated result is indepen- 
dent of the number of processor chips used to calculate one 
force. Since the actual summations, both within the chip 
and outside the chip, are done in fixed-point format, no 
round-off error is generated during summation. Of course, 
round-off error is generated when we shift the calculated 
force to meet the block floating point format, but this er- 
ror is independent of the order in which the summation is 
performed. In the case of the usual floating-point format 
used in GRAPE-4, the round-off error generated in the 
summation depends on the order in which the forces from 
different particles are accumulated, and therefore the cal- 
culated force is not exactly the same, if the number of 
boards in the system is different. 

Of course, this difference does not have any effect on 
the accuracy of the simulation itself, since the word length 
itself is chosen as such. However, it is quite useful to be 
able to obtain exactly the same results on machines with 
different sizes, since it makes the validation of the result 
much simpler. 

6. Performance 

In this section, we discuss the performance of GRAPE- 
6 system, both for the direct summation algorithm with 
individual timestep and the tree algorithm. For both algo- 
rithms, we discuss the performance of single-host system 
and multi-host system. 

6.1. Direct summation with individual timestep 

Here we discuss the performance of GRAPE-6 for the 
individual timestep algorithm. As the benchmark run, 
we integrate the Plummer model with equal-mass parti- 
cles for 1 time unit (we use the "Heggie" unit, Hcggie & 
Mathieu 1986, where the gravitational constant G and to- 
tal mass of the system M are both unity and the total en- 
ergy of the system E is — 1/4) . We used standard Hermite 
integrator (Makino & Aarseth 1992) with the third-order 
predictor. Timestep criterion is that of Aarseth (Aarseth 
1999) with 7/ = 0.01. For softening parameter, we tried 
three different choices. The first one is a constant soften- 
ing, e=l/64. We also tried e=l/ [8 (27V) 1 / 3 ] ande = 4/A, 
to investigate the effect of the softening size. Note that 
for N = 256, all three choices of the softening give the 
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Fig. 18. CPU time in second to integrate a Plummer model 
for 1 time unit plotted versus the number of particles N. 
Solid, dashed and dotted curves indicate the result with con- 
stant, 1/7V 1 / 3 and 1/JV softenings, respectively. 

same value. In the following, we first describe the perfor- 
mance of a single-host system (with 4 processor boards). 
Then we discuss the performance of a single cluster with 
2 or 4 hosts, and finally we discuss the performance of 
multiple-cluster configurations. 

6.1.1. single-host performance 

Figure 18 shows the CPU time to integrate the system 
for one time units. We actually measured the CPU time 
for integration from time 0.25 to 1.0 and multiplied the 
result by 4/3, since the step size after the start of the inte- 
gration is too small because of the initialization procedure. 
From figure 18 we can see that the CPU time is almost 
proportional to N for N < 10 5 , but for TV-dependent soft- 
enings the dependence is slightly higher. For N > 10 5 , the 
slope approaches to 2. 

Figure 19 shows the actual calculation speed achieved. 
The theoretical peak speed of the single-host, 4-PB system 
is 3.94 Tflops. Here, we define the calculation speed as 

S = 57Nn steps , (10) 

where n steps is the average number of individual steps 
performed per second. The factor 57 means we count one 
pairwise force calculation as 57 floating-point operations. 
We took this number from recent literatures. From this 
figure, we can see that the achieved speed is practically 
independent of the choice of the softening. The reason 
why calculations with smaller softening takes more CPU 
time is that the number of timesteps is larger, as shown 
in figure 20. For calculations with iV-depcndcnt soften- 
ings, the number of block steps increases significantly as 
we increases N. This means that the average number of 
particles in one block grows rather slowly. However, as 
we can see from figure 19 this does not affect the achieved 
performance. 

Roughly speaking, we can model the calculation time 
per one particle step as follows: 

^single = (l—f)Thost + T coram + max(TcRAPE , fThost), ( 1 1 



Fig. 19. Same as figure 18, but calculation speed in Gfiops 
is plotted. 
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Fig. 20. Same as figure 18, but the number of total individ- 
ual steps (upper) and block steps (lower) are plotted. 

where T nos t is the time for the host computer to perform 
computations to integrate one particle, T comm is the time 
needed for the communication, and Tqrape is the time 
to calculate the force on GRAPE. The factor / is the frac- 
tion of the operations that the host computer can perform 
while GRAPE is calculating the force. The program we 
used tries to perform the time integration on host and the 
force calculation on GRAPE with as much concurrency as 
possible. 

We can estimate T comm as follows. The total amount 
of data transferred for one particle step is currently 200 
bytes. With the present host, the effective data transfer 
rate for DMA transfer is 80 MB/s. Therefore 

T comm = 200/(8 x 10 7 ) = 2.5 x 10 _6 sec. (12) 

The calculation time on GRAPE is expressed as 

Tgrape = N/(9 x 10 7 n pipes ) = 1.447 x 10- n JVscc,(13) 

where n P i pea is the total number of pipelines. With our 
current system n p i pes = 768. 

In figure 21, the solid curve shows the measured CPU 
) time per step. The dashed curve is a fit, with T nos t = 8.5 x 
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Fig. 21. CPU time per one particle step plotted as a func- 
tion of the number of particles N. Solid curve is the measured 
result. Dashed and dotted curves denote two different theo- 
retical estimates. 

10 _6 sec and / = 0. We can see that agreement between the 
theory and experimental result is good for large N, but is 
rather poor for small N. This is because we ignored the 
effect of the cache memory on T^ ost . The dotted curve 
is the theoretical estimate with a heuristic model for the 
cache effect. For this curve, we used 
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= 5.5 x l(T 6 c + 8.5 x 1CT 6 (1 

where c is expressed as 

1, (A < 1000), 

y/N/1000, (A > 1000). 

This model is purely empirical, but apparently gives a rea- 
sonable description for the performance. Since this effect 
of the cache is rather large, it turned out to be difficult to 
determine the value of / empirically. We assumed / = 0. 

For N < 1000, the experimental value is larger than the 
prediction of the refined theory. This is because the num- 
ber of particles in one block is too small. The overhead to 
invoke DMA operations becomes visible. 

Up to here, we discussed only the speed of a 4-PB sys- 
tem. Since there are many installation of GRAPE-6 out- 
side Tokyo university with one PB connected to a host, it 
would be useful to give the performance of smaller con- 
figurations. Figure 22 gives the estimated performance of 
4-, 2- and 1-PB system. One can see that performance 
difference is rather small for A < 3 x 10 4 . For A > 10 5 , 
performance difference becomes significant. 

6.1.2. multi-host performance 

Figure 23 shows the calculation speed for multi-host 
systems with up to 4 hosts. The peak speed of 2- and 
4-hosts systems are 7.88 Tflops and 15.76 Tflops, respec- 
tively. For up to 4 hosts, the network boards are used to 
distribute the data, and the communication network be- 
tween the host computers are used primarily for synchro- 
nization. The parallel program itself is written using MPI, 
and we used MPICH/p4 over TCP/IP as the MPI library. 
The network interface is Plancx GN-1000TC Gigabit NIC, 
which uses NS 83820 chip. We found the performance of 
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Fig. 22. The estimated performance of 4-, 2- and 1-PB sys- 
tems as the function of the number of particles N. Solid, 
dashed and dotted curves denote the speed of 4-, 2- and 1-PB 
systems, respectively. 

MPICH/p4 on this network interface to be quite unsatis- 
factory, and used UNIX TCP/IP socket system calls for 
actual communication. 

We can see that multi-host codes require rather large 
number of particles to achieve the speed faster than that 
of the single-host code. Even with the constant softening, 
the two-host code becomes faster than the single-host code 
only at A ~ 3000, and for e = 4/A, this crossover point 
moves to around A ~ 10 4 . 

Figure 24 shows the calculation time per one particle 
step for 4-node parallel calculation. The measured value 
is obtained by dividing the total number of particle steps 
by the wallclock time. This figure clearly shows why the 
value of A for the crossover is rather large. For "small" A 
(A < 10 4 ), the calculation time is inversely proportional 
to the number of particles A. This is because the com- 
munication between hosts, which takes constant time per 
one blockstep, dominates the total cost in this regime. To 
be quantitative, the calculation time per one particle step 
is expressed as 

(16) 



-^mn,p — ^single j V ~t~ ^comm, hosts: 



where p is the number of nodes in a cluster and T comm h os t s 
is the communication time expressed as 

Tcom, m ,hosts = 6(log 2 p + l)t sync /n b , (17) 

where (log 2 p + l)t sync is the time to complete a barrier 
synchronization for parallel code running on p nodes. The 
logarithmic factor comes from the fact that synchroniza- 
tion requires log 2 p+l stages. The divisor, rib, is the 
average number of particles integrated in one blockstep. 
For our current implementation of the synchronization, 
we found t sync = 250/xs. The factor 6 is the number of 
synchronization operations necessary in one blockstep. 

Theoretical estimates shown in figure 24 are calculated 
using equation (16). Here again, the agreement between 
the measured result and the theory with the effect of the 
cache memory of the host is very good. To evaluate Thost , 
we used N/p instead of A in equation (15), since one node 
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Fig. 23. The calculation speed in Gflops plotted as a func- 
tion of N. Solid, dashed and dotted curves show the results 
for 1, 2 and 4- node systems, respectively. The left panel shows 
the result for constant softening, and the right panel e = 4/N. 
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Fig. 24. Same as figure 21 but for the case of 4-node parallel 
calculation. 




Fig. 25. The calculation speed in Tflops plotted as a func- 
tion of N. Solid, dashed and dotted curves show the results 
for 4, 8 and 16-host (1, 2, and 4-cluster )systems, respectively. 
Constant softening is used for all runs. 

handles N/p particles. 

6.1.3. multi- cluster performance 

Figure 25 shows the calculation speed for multiple- 
cluster systems, as a function of the number of particles 
in the system N. The crossover point at which multi- 
cluster systems becomes faster than single-cluster system 
is rather high (N ~ 10 5 ), and even for = 10 6 , the 
speedup factors achieved by multi-cluster systems are sig- 
nificantly smaller than the ideal speedup. 

For multi-cluster system, the calculation time per one 
particle step can be estimated as follows. In our current 
implementation of the multi-cluster calculation code, one 
host of a p-hosts, g-clustcr system (therefore p/q hosts 
in a cluster) handles N/p particles. The forces from the 
particles in the hosts in the same cluster can be calculated 
using the hardware network on the side of the GRAPE- 
6. However, one cluster need to gather the information of 
particles on different clusters. By letting each of p/q hosts 
in one cluster receive data from other q—1 hosts, we can 
let one cluster maintain complete date of all N particles. 
This is just one of many possible implementations. For 
small value of q, theoretically, this is close to the best 
possible implementation. 

With this implementation, the calculation time per one 
particle step is expressed as 



Tmc,p — ^single / P ~l~ ^( 



comm.hosts 



T, 



comm. .cluster s j 



(18) 



where T, 



coram, c 



luste 



is the time for communication be- 



tween hosts in different clusters. It is expressed as 

Tcomm. clusters — 72 {/2,t coram ,net ~t~ t cornnl .grape*) ((? f)/P;(^) 



where t 



comtn.net 



and t. 



comm, grape 



are the time to send 1- 
byte date through the network interface of host and host- 
GRAPE interface, respectively. The constant factor of 72 
is the length of data for one particle in bytes. The next fac- 
tor of 2 comes from the fact that each node needs to both 
send and receive the data. The factor q—1 appears since 
one node receives data from q — 1 other nodes. We used 
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Fig. 26. Same as figure 21 but for the case of 16-host parallel 
calculation. 

tcomm.net — 1.7 X 10 S and t C omm, grape — 1.25 X 10 S. 

These values are based on separate measurement using 
small benchmark programs. Figure 26 shows the calcula- 
tion time per one particle step for full-cluster calculation 
(16 nodes, 4 clusters). The agreement between the theo- 
retical estimate and the measured value is fairly good, but 
not ideal. We probably have underestimated t commnet in 
the real program. 

6.1.4- Summary for the direct summation code 

In this section we presented the performance figures of 
GRAPE-6 for the direct A^-body simulation. As described 
in the introduction, this kind of simulations is the main 
target of GRAPE-6. So we made fairly detailed analysis 
of the performance. What we have found is summarized 
as follows. 

In the case of the single-host configuration, the calcu- 
lation speed of the present host computer is the largest 
bottleneck of the performance, and the communication 
speed is relatively unimportant. This means that we can 
keep improving the overall performance of the system just 
by replacing the host computer, for the next several years. 

For the multi-host configuration, the situation is rather 
different. In the case of a single cluster (no host-host data 
transfer), the performance for small- N runs is determined 
by the overhead of the barrier synchronization between 
host computers. We currently use standard UNIX imple- 
mentation of TCP /IP socket for the basic communication, 
and TCP/IP socket is certainly not the communication 
software with lowest possible latency. The use of com- 
munication software/hardware with lower latency would 
significantly improve the performance. 

Finally, for the case of multi-cluster configuration, as 
expected, the performance is limited by the bandwidth 
of the communication between hosts. Currently, we 
use Gigabit Ethernet card on 32-bit, 33-MHz PCI bus. 
Clearly, by going to faster bus (PCI-66 or PCI-X) and 
faster CPU, the communication bandwidth will be im- 
proved significantly. 

To summarize, at present the performance of GRAPE-6 
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Table 4. CPU time distribution for treecode 



Operation 


Time (sec, = 1) 


(0 = 0.5) 


Tree construction 


7.57 


7.57 


Force calculation 


18.40 


27.62 


Other operations 


1.86 


1.86 


Total 


27.83 


37.05 



for small- N calculations is limited by the speed of the host 
computer, and by the latency of the communication be- 
tween hosts when multi-host or multi-cluster systems are 
used. Even so, GRAPE-6 can achieve the speed exceeding 
100 Gflops, for relatively small number of particles such 
as 16k. In the coming several years, the improvement of 
the host computer will improve the overall performance of 
the system. 

6.2. Tree algorithm 

Here we discuss the performance of GRAPE-6 for 
Barnes-Hut tree algorithm. We used the modified al- 
gorithm introduced by Barnes (1990). We discuss the 
performance of single-host code and multi-host (parallel) 
code with up to 12 host computers. The parallel algo- 
rithm is based on the space decomposition similar to the 
well-known orthogonal recursive bisection (ORB) method 
(Dubinski 1996). The detail of the parallel algorithm will 
be discussed elsewhere. 

6.2.1. single-host performance 

Figure 27 shows the CPU time per timestep as a func- 
tion of the number of particle N . The distribution of 
particles is a Plummer model, with the outer cutoff ra- 
dius of 22.8 in Heggie units. We used n g = 20, 000 as the 
maximum group size for the modified algorithm. 

We can see that the CPU time grows practically linearly 
as we increase N . Also, the dependence on the opening 
angle is rather weak. This weak dependence is the char- 
acteristic of GRAPE implementation of the tree algorithm 
(Makino 1991c; Athanassoula et al. 1998). 

Table 4 gives the breakdown of the CPU time per step 
for calculation with TV = 2 21 . The average length of the in- 
teraction list was 1.01 x 10 4 and 1.69 x 10 4 for = 1.0 and 
0.5, respectively. The number of groups is 310 for both 
cases. As in the case of tree algorithm on older GRAPE 
hardwares, the performance is limited by the speed of 
the host and that of communication. Actual calculation 
on GRAPE-6 takes less than three seconds, for the case 
of = 0.5. The calculation on the host (tree construc- 
tion, tree traversal, and other calculations including the 
data conversion between GRAPE-6 internal format and 
floating-point format) count for roughly 2/3 of the remain- 
ing time, and actual communication 1/3. This, again, im- 
plies there is a rather large room for the improvement of 
the speed, just by moving to faster host computers. 

6.2.2. multi-host performance 

Since the performance of the single-host GRAPE-6 is 
limited by the speed of the host computer, an obvious way 
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Fig. 27. CPU time in seconds per timestep plotted as a func- 
tion of the number of particles iVoEa single-host configura- 
tion. Solid, dashed and dotted curves are for = 1, 0.75, 0.5, 
respectively. 




Fig. 28. CPU time in seconds per timestep plotted as func- 
tion of the number of host computers n p . Parallel tree algo- 
rithm is used with 6 = 1.0. Upper and lower curves are the 
result for 2 21 and 2 20 particles, respectively. Initial distribu- 
tion of the particles is a Plummer model. 

to improve the performance is to use multi-host systems. 

Figure 28 shows the performance of the parallel tree 
algorithm. The program used is a newly written one based 
on orthogonal recursive multi-section, a generalization of 
widely used ORB tree that allows a division to arbitrary 
number of domains in one dimension, instead of allowing 
only bisection. The primary advantage of this algorithm 
is that it can be used on systems with number of host 
computers not exactly a power of two. We measured the 
performance on 1, 2, 3, 4, 6, 8 and 12 hosts. 

The distribution of the particles is again the Plummer 
model. One can see that the scaling is again pretty good. 
12-hosts calculation is 9.3 times faster than single-host 
calculation. Parallel efficiency is better than 75%, even 
for relatively small number of particles shown here. 



7. Discussion 

7.1. Hindsights 

Though we regard GRAPE-6 a reasonable success, this 
certainly does not mean we did everything right. We did 
make quite a few mistakes, some of them affected the 
performance, some affected the reliability, some extended 
the development time, and some limited the application 
range. In the following we briefly discuss them in turn. 

7.1.1. Performance 

Concerning the performance, the largest problem with 
GRAPE-6 is that its clock frequency is somewhat below 
the expected value. The design goal (for the "worst case") 
was 100 MHz, while our actual hardware is currently run- 
ning at 90 MHz. With GRAPE-4, the design goal was 
33 MHz, and the machine operated without any problem 
at 32 MHz. The processor chip itself was confirmed to 
operate fine at 41 MHz. 

The primary reason for the low operating frequency is 
the problem with the stability of the power supply to the 
chip, or the impedance of the power line. Compared to 
the GRAPE-4 processor chip, GRAPE-6 processor chip 
consumes about two times larger power, at half the supply 
voltage. Thus, to keep the relative drop of the supply 
voltage to be the same, the impedance of the power line 
must be 1/8 of that of GRAPE-4. 

This is a quite difficult goal to achieve. Initially even 
the manufacturer of the chip did not fully appreciated how 
hard it was. As a result, the first sample of the chip could 
not operate correctly at the clock speed higher than 60 
MHz. The problem was that, when calculation starts, the 
power consumption of the chip increases by about a factor 
of two compared to that when the chip is idle. Because 
of the large total resistance of the power line in the LSI 
package and the power plane of the silicon chip itself, the 
supply voltage to the transistors decreases, and as a result 
their switching speed slows down. 

In the second design, the manufacturer came up with 
additional power plane and increased number of power 
and ground pins, which reduced the resistance signifi- 
cantly. However, the result was still rather unsatisfactory. 

The manufacturer was not alone in making this kind of 
mistakes. In the design of the processor board and the 
power supply, we also made similar mistakes. In the first 
design, we used traditional large switching power units 
with relatively low switching frequency This unit turned 
out to be unable to react to the quick change of the load 
between idle and calculation states. The normal elec- 
trolytic capacitors also turned out to be completely useless 
in stabilizing the power supply voltage. Thus, we need to 
redesign the power supply unit with high-frequency in- 
verters and low-ESR capacitors. 

In hindsights, we could have borrowed the design of 
power supply units for standard PC motherboards (for 
Intel processors), which were designed to meet quite sim- 
ilar requirements, but for an extremely low cost. The 
power supply circuit for typical PC motherboard would 
be good enough to support single module with 4 chips. 
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We then can supply 12 V to PCB. 

These apparently minor technical details are absolutely 
crucial for the manufacturing of high-performance com- 
puters. 

Another problem with the current GRAPE-6 chip is its 
limited I/O performance of only 90MB/s. As we stated 
earlier, this bandwidth is sufficient to keep the standard 
PCI interface busy, and it is not really the bottleneck, 
since, for many applications, the calculation on the host 
computer is more time-consuming. Even so, in a few years 
the I/O performance will become a problem. Additional 
problem is that host computers with faster PCI inter- 
faces (PCI64 and PCI-X) are now available. We cannot 
take advantage of these faster interfaces with the current 
GRAPE-6 design, because the I/O bandwidth of the pro- 
cessor chip is limited. We could have increased the I/O 
bandwidth of the chip without too much problem, by al- 
lowing the change in the ratio between the chip clock and 
board clock. With our current design, this ratio is effec- 
tively fixed to 4. 

Even with the current chip design, we could have in- 
creased the communication bandwidth of the processor 
board, without increasing that of the chip, by letting mul- 
tiple chips to transfer the data simultaneously. This possi- 
bility should have been considered, to increase the lifetime 
of the hardware. 

7.1.2. Reliability 

Since the GRAPE-6 system consists of exceptionally 
large number of arithmetic units, one might imagine that 
the primary source of the error is the calculation logic it- 
self. In practice, however, we have almost never seen any 
calculation error, once the power supply had become good 
enough. On the other hand, we saw quite a few errors in 
data transfer. 

We implemented ECC circuit for the memory interface 
of the processor chip, but only added parity detection cir- 
cuit to I/O ports. We thought this is reasonable, since 
memory ports operate on 90 MHz clock and I/O ports on 
22.5MHz. However, it turned out that memory parity er- 
ror almost never occur, while parity error for I/O occurs 
rather frequently. Since we do not exactly know the type 
of the error, it is not 100% clear whether the ECC capabil- 
ity would have helped or not. However, it is at least clear 
that more reliable data transmission would be better. 

A more serious problem with the reliability was very 
high defect rate for mass-produced processor boards and 
processor modules. Practically all failures were due to un- 
reliable soldering, and most of soldering problem turned 
out to be simply due to lack of skill of the manufacturer. 
This may be telling something about the present perfor- 
mance of Japanese high-tech industry. Even so, it is cer- 
tainly true that to manufacture a rather small quantity 
of PCB board is difficult. We could have designed either 
more automated test procedures for boards (with JTAG 
standard) or redundant connection. Yet another possibil- 
ity was to reduce the number of wires by using higher- 
frequency signals. 



7.1.3. Development time 

As we discussed in section 5, the use of the parallel 
host was inevitable. However, the use of the multicast 
network was not, at least in hindsight. We assumed that 
the price of high-end uniprocessor computers would not 
change greatly, and that the cost of high-bandwidth net- 
work adapters (1 Gb/s or higher) would remain high. In 
other words, we assumed that we could not afford to buy 
^100 fast host computers and to connect all of them by 
a fast network. Therefore, we designed our own network, 
which connects r host computers to r 2 processor boards. 
This approach worked fine, as we have seen in the previ- 
ous section. However, an alternative design, in which we 
connect each processor board to its own host computer, 
would have been much easier to develop. 

In 1997, the fastest systems are RISC-based UNIX 
workstations, with price higher than 20K USD. In 2003, 
systems based on the Intel x86 architecture offer the speed 
similar to that of the RISC based systems with highest 
performance, for the cost of less than 2K USD. TO illus- 
trate this, we use SPECfp (either 95 or 2000) numbers 
as representative of the performance. In 1997 the speed 
difference between RISC systems and x86 systems were 
nearly a factor of three. This ratio had been almost con- 
stant during 1990s. The reason why the ratio started to 
shrink is simply that the rate of improvement of the per- 
formance of RISC-based systems slowed down. Thus, it 
would have been difficult to predict the present state in 
1997, or even in 1999. In other words, even though now 
it is clear that network hardware of GRAPE-6 is not nec- 
essary, until 2000 we had no other choice. 

Another reason for the rather long development time 
(aside from the problems with the power supply) is the 
fact that we integrated effectively all functions of the sys- 
tem to the processor chip. It integrates the memory con- 
troller, the predictor pipeline, and all other control logics. 
Except for the predictor pipeline, all these logics were im- 
plemented with FPGAs on GRAPE-4. This integration 
simplified the design of the board, and fortunately, we 
have not made any serious mistake in the design of these 
parts. The integration of these complicated logics onto 
hardware required extremely careful design and test pro- 
cedures which were time-consuming. With the present 
price of moderately large FPGA chips, all of these con- 
trol logics could be implemented using FPGAs, with very 
small additional cost. Of course, such moderately large 
and inexpensive FPGA was not there when we decided 
on the design of GRAPE-6 chip. However, we could have 
predicted the direction of the evolution of FPGA chips 
and estimated the price of them. 

7.1.4- Application range 

Since GRAPE-6 is designed solely for the gravitational 
iV-body problem, one might think there is not much of 
the range of applications. However, even within iV-body 
simulations, there are many factors. 

The overall design of GRAPE-6 is highly optimized to 
parallel execution of the direct force calculation with the 
individual timestep algorithm. This of course means it is 
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not optimal for other applications, such as tree algorithm 
and SPH calculations of self-gravitating fluid. 

With the case of the tree algorithm, the performance is 
limited mainly by the speed of the host computer. So, in 
this case, adding more host computers would have greatly 
improved the performance. 

In principle we could have improved the performance 
of the tree algorithm in several other ways. One obvi- 
ous approach is to reduce the data to send. With tree 
algorithm, we would not use the predictor. Moreover, we 
would not need the full 64-bit resolution for the position 
data. Thus, we could have implemented some way to re- 
duce the data to send for j-particles, if our memory con- 
troller was not implemented in hardware. Actually, the 
memory controller of GRAPE-6 has some programmabil- 
ity However, one "feature" of this memory controller pre- 
vented us from taking full advantage of this programma- 
bility to reduce the amount of the data transfer. 

With an FPGA implementation of the memory con- 
troller, we could implement other ways to further reduce 
the communication. For example, we could implement in- 
direct addressing, so that we can send indices of j particles 
instead of sending their physical data. 

Concerning the design of the pipeline, one thing which 
might have been useful for simulation of collisionless sys- 
tems or composite A-body+SPH systems is the ability to 
apply different softening length on different particles, in 
a symmetrized way. This can be achieved by calculating 
the softened distance as 

r* = r&+e? + 4 (20) 

The pipeline will need one more addition, which is rela- 
tively inexpensive. 

With SPH, the main problem is that the calculation 
of SPH interactions itself cannot be done on GRAPE- 
6. The PROGRAPE system(Hamada et al. 2000), with 
the calculation pipeline fully implemented in FPGA, could 
be used to perform the calculation of SPH interaction. 
Moderately large PROGRAPE system is currently under 
development. 

With the logic design of the pipeline, we have noticed 
a few problems which we could not foresee. One is the 
length of the accumulator for the time derivative of the 
force. For the force and potential, we used 64-bit accu- 
mulators, but for the time derivative we used 32-bit accu- 
mulators. As far as the accuracy is concerned, this length 
is long enough. However, when we performed simulations 
with large number of particles, we realized that the over- 
flow occurred rather frequently The reason why the over- 
flows occurs is that the magnitude of the time derivative of 
the force can change by a large factor in a single timestep. 
The large change occurs when the previous value happens 
to be almost zero. We could circumvent this problem with 
a combination of guess for the likely value of the time 
derivative of the force based on the value of force and 
timestep, but it is cumbersome to implement and expen- 
sive to evaluate. By increasing the accumulator length to, 
say, 40 bits, we could have almost completely eliminated 
the overflow. This overflow does not have any noticeable 



impact on performance. But the need to handle overflows 
made the interface program rather complicated. 

7.2. GRAPE-7/8 

Given that GRAPE-6 is now completed and we already 
have the experience of running it for almost two years, 
it would be natural to put some thought on how its suc- 
cessor will look like. In this section, we first discuss the 
change in the technologies, and then overview the design 
possibilities. 

7.2.1. Technological changes and the basic design 

Compared to the technology used in GRAPE-6, what 
will used in the next GRAPE system (we call it NGS for 
short) will be different in 

(a) Semiconductor technology 

(b) development cost 

(c) Host I/O bus 

First let us discuss the semiconductor technology. 
GRAPE-6 used 0.25^m technology, while NGS would use, 
depending on the time to start, either 130nm or 90nm 
technology. Since it seems we are not getting the bud- 
get too soon, we will probably use 90nm. This means we 
can pack about 8 times more transistors to the chip of 
the same size, and the switching speed will be about 3 
times faster. Thus, a single chip of the same size can of- 
fer 20 times more computing power. If the power supply 
voltage is reduced by the same factor, the power consump- 
tion would remain the same, but most likely the supply 
voltage would be somewhat higher, resulting in significant 
increase in the power consumption. 

To express in concrete numbers, a single chip would in- 
tegrate around 50 pipeline processors, each with 60 arith- 
metic units operating on 300 MHz clock speed, with 1.2V 
supply voltage and power consumption of 20 W. The the- 
oretical peak speed of the chip will be around 600 Gflops. 

Compared to the projected speed of general-purpose mi- 
croprocessors in, say, 2007, this speed is quite attractive. 
In 2007, microprocessors will, at best, have the peak speed 
10 times faster than they have now, or about 30 Gflops. 
Typical performance on real application would be around 
10 Gflops or less, for the power consumption of 100 W or 
more. 

A necessary consideration is how we connect the 
pipelines to memory. If we use the same memory sys- 
tem as we used for GRAPE-6, the total number of virtual 
pipelines per chip becomes 1,000, which is too large for 
a simulation of any collisional system. As was the case 
with GRAPE-6, it is necessary to keep the number of i- 
particles calculated in parallel to be around 500 or less, 
for large systems with many chips. So the number of vir- 
tual pipelines per chip must be less than 200, or ideally 
less than 100. In other words, the memory bandwidth 
must be increased by at least a factor of five, to around 
3.5 GB/s. 

This number, by itself, sounds relatively easy to achieve. 
It is the same as what was used with the first Intel P4 
processor (3.2GB/s), using two DRDRAM channels each 
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with 16-bit data width. Intel P4 has been around for 
more than two years. Now we can also use DDR 400 
memory chips, which have 4 times more throughput than 
the SSRAM chips used in GRAPE-6. We could also use 
DDR SRAMs. 

The choice of the memory interface has strong impact 
on the range of the applications. One major limitation of 
GRAPE-6 was that, as was discussed in the previous sec- 
tion, its memory addressing scheme was limited only to 
the sequential access to a full set of predictor data. Thus, 
it is not easy to use the tree or other sophisticated algo- 
rithms efficiently on GRAPE-6. One possibility to solve 
this problem is to implement the memory controller and 
other control logics in an FPGA chip. The connection 
between the FPGA chip and the pipeline chip must be 
quite fast, but this is relatively easy to achieve since the 
data transfer is unidirectional, from the FPGA chip to the 
pipeline chip. The memory controller will be implemented 
in the FPGA chip. Thus, it will be possible to use differ- 
ent types of memory (DRDRAM, DDR DRAM/SRAM) 
without any need to change the pipeline chip. 

As we discussed earlier, parallelism will be achieved by 
two-dimensional network of host computers. Each of them 
will have a relatively small GRAPE system. As an exam- 
ple, we consider a system with 256 host computers each 
with two GRAPE cards. Each card houses 4 processor 
chips with their own memory control units and memories. 
All of them can be packaged into single card of the PCI 
form factor, though we need special care for the power 
supply. 

For the interface to the host, the easiest solution is to 
use PCI-X, which is available now with the data trans- 
fer speed of up to 1 GB/s. PCI-X gives us an order-of- 
magnitude increase in the communication speed, which 
roughly balances the increase in the performance of a fac- 
tor of 20. One problem is whether or not PCI-X will be 
around 5 years from now. We need to predict the market 
trend, or develop a design that can use multiple interfaces. 

Note that this factor-of-10 increase in the communi- 
cation implies that the chip-to-chip communication must 
also be faster by the same factor. This is not easy, but 
since the physical size of the board will be much smaller, 
it would not be impossible to use fast clocks. 

Thus, the design of NGS seems to be simple, as far as 
we set the parallel execution of the individual timestep 
algorithm with direct summation as the primary design 
target. 

The only, but quite serious, problem is that the pre- 
dicted initial cost for the custom chip will be very high. 
The initial cost for a custom chip has been increasing quite 
steeply. Roughly speaking, the initial cost has been pro- 
portional to the inverse of the design rule. Thus, while the 
initial cost of the GRAPE-4 chip was around 200K USD, 
that for GRAPE-6 exceeded 1M, and for NGS it will reach 
3M. Even though this is "small" compared to the price of 
any massively-parallel supercomputer or even PC clusters, 
to get a grant of this size within the small community of 
theoretical astrophysics in Japan is not easy. 



7.2.2. Combination with sophisticated algorithms 

One rather fundamental question concerning the next 
GRAPE system is whether the direct summation is really 
the best solution or not. McMillan and Aarseth (1993) 
have demonstrated that it is possible to implement a com- 
bination of the Barnes-Hut tree algorithm and the individ- 
ual timestep algorithm that runs efficiently at least on a 
single-processor computer, and potentially also on shared- 
memory parallel computers. Even when we require very 
high accuracy, the gain by tree algorithm is large for large 
N. For example, the number of interactions per particle 
to achieve the relative force accuracy of 10 is around 
8,000 when quadrupole moment is used and around 2,000 
when octupole moment is used, for number of particles 
around 10 6 . Thus, even if we assume the calculation of 
octupole costs a factor of 10 more than point-mass force, 
the calculation cost of the tree algorithm would be a factor 
of 50 less than that of the direct calculation. 

Even though the scaling is not as drastic as that 
of the tree algorithm, the Ahmad-Cohen scheme (1973, 
also known as the neighbor scheme) offers quite signif- 
icant reduction of the calculation cost over the simple 
direct summation. The theoretical gain in the calcula- 
tion cost is 0{N 1 / i ) for the neighbor schemc(Makino & 
Hut 1988; Makino & Aarseth 1992). However, the actual 
speedup is nearly a factor of 10, for only 1,000 particles, 
thus, for 10 6 particles the gain can reach a factor of 50. 

For 10 6 particles both the tree algorithm and neigh- 
bor scheme, at least theoretically, offer the reduction in 
the calculation cost of around a factor of 50. This fac- 
tor is certainly still smaller than the advantage of the 
GRAPE hardware over general-purpose computers, since 
the difference in the price-performance ratio will exceed 
10 3 . However, if we can incorporate either of these so- 
phisticated algorithms, even with significant loss in the 
hardware efficiency like a factor of 5 or even more, we can 
still achieve a very significant improvement in the over- 
all speed. We are currently investigating several possible 
ways to achieve this goal. 
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